Tidy data

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Introduction
  • Untidy data
  • Missing data
  • Separate data
  • Codes and codebooks

Warm up

R learning curve

Warm up

Warm up

Load packages.

library(dplyr)
library(ggplot2)
library(readr)
library(tidyr) # <- tidytidytidy
library(countrycode)

Introduction

Introduction

“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.

Source: R for Data Science

Introduction

A df is tidy if it fulfills these requirements (Wickham 2014):

  • Each df has one unit of observation.
  • Observations represented in the rows.
  • Variables represented in the columns.
  • Each cell indicates a value.

Source: R for Data Science

Introduction

But …

Votes to parties in the 2019 Spanish elections

# A tibble: 15 × 9
   provincia municipio           votos…¹    pp    cs podem…² mas_p…³    eb pacma
   <chr>     <chr>                 <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>
 1 Almería   Abla                    710   193    47      30       0     2     3
 2 Almería   Abrucena                699   111    45      42       0     0     2
 3 Almería   Adra                  10941  3126   574     645       0    16    81
 4 Almería   Albánchez               272    56    41      18       0     0     2
 5 Almería   Alboloduy               419   155    17      10       0     0     2
 6 Almería   Albox                  4335  1261   378     247       0     2    28
 7 Almería   Alcolea                 509    92    39      43       0     1    11
 8 Almería   Alcóntar                371   125    20      15       0     2     3
 9 Almería   Alcudia de Monteag…     106    47     5       6       0     0     0
10 Almería   Alhabia                 409    95    35      23       0     0     1
11 Almería   Alhama de Almería      1984   402   103     182       0     0    18
12 Almería   Alicún                  150    38    11       7       0     0     3
13 Almería   Almería               92383 24495  7848    9699       0   204   964
14 Almería   Almócita                103    10     6      16       0     1     0
15 Almería   Alsodux                  81    16     1       9       0     3     3
# … with abbreviated variable names ¹​votos_validos, ²​podemos_iu, ³​mas_pais

Introduction

But …

Minor contracts granted in Barcelona

# A tibble: 15 × 5
   ens                        contracte       data       proveidor        import
   <chr>                      <chr>           <date>     <chr>             <dbl>
 1 CONSORCIS                  Serveis         NA         Marc Redorta      1646.
 2 CONSORCIS                  Serveis         NA         JESUS SANZ LOPEZ   360 
 3 CONSORCIS                  Serveis         NA         JOAN BOSCH MUNT…   690.
 4 CONSORCIS                  Serveis         NA         ALLIANZ, Compañ…   456.
 5 FUNDACIONS I ASSOCIACIONS  <NA>            NA         Vopi4Elecnor      1891.
 6 FUNDACIONS I ASSOCIACIONS  Serveis         2918-11-19 Jaume Badosa      1754.
 7 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         WHADS MEDIA STU…  1392.
 8 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         AJUNTAMENT BARC…  4655.
 9 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         INADHOC HABITAT   9859.
10 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         DOT CONSULTING   18090.
11 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         SAYTEL           15562.
12 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         ID GRUP           1197.
13 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         ID GRUP           2117.
14 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         CODINA, ESTHER    1600 
15 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         ABELEIRA, ANTON…  1200 

Introduction

But …

Price of rents

# A tibble: 15 × 3
   nom_barri                              preu preu_m2
   <chr>                                 <dbl>   <dbl>
 1 el Raval                               590.   10.8 
 2 el Barri Gòtic                         713.   10.6 
 3 la Barceloneta                         541.   14.4 
 4 Sant Pere, Santa Caterina i la Ribera  673.   11.0 
 5 el Fort Pienc                          736.   10.4 
 6 la Sagrada Família                     673.   10.6 
 7 la Dreta de l'Eixample                 921.    9.84
 8 l'Antiga Esquerra de l'Eixample        828.   10.4 
 9 la Nova Esquerra de l'Eixample         716.   10.3 
10 Sant Antoni                            693.    9.77
11 el Poble Sec                           568    10.2 
12 la Marina del Prat Vermell              NA    NA   
13 la Marina de Port                      554.    8.34
14 la Font de la Guatlla                  632.   10.8 
15 Hostafrancs                            581.   10.4 

Income per capita

# A tibble: 15 × 3
   nom_barri    sc import_euros
   <chr>     <dbl>        <dbl>
 1 el Raval      1        27503
 2 el Raval      2        21913
 3 el Raval      3        24220
 4 el Raval      4        27405
 5 el Raval      5        23014
 6 el Raval      6        25581
 7 el Raval      7        25959
 8 el Raval      8        21900
 9 el Raval      9        24849
10 el Raval     10        23306
11 el Raval     11        21571
12 el Raval     12        25592
13 el Raval     13        27908
14 el Raval     14        26078
15 el Raval     15        26092

Introduction

Data is untidy because is:

  • Variables are not represented in the columns.
  • Missing: Cells do not contain a value.
  • Separate: Observations are in many datasets.

Tidy data

Untidy datasets?

Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).

Table A

# A tibble: 6 × 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Table B

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Table C

# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Table D

# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

A tidy dataframe

NOO ANOTHER

Correlates of War Intra-State War Data

# A tibble: 442 × 4
   WarName            WarType SideA          SideB                           
   <chr>                <dbl> <chr>          <chr>                           
 1 First Caucasus           5 Russia         Georgians, Dhagestania, Chechens
 2 Sidon-Damascus           6 Sidon          Damascus & Aleppo               
 3 First Two Sicilies       4 Austria        -8                              
 4 First Two Sicilies       4 Two Sicilies   Liberals                        
 5 Spanish Royalists        4 Spain          Royalists                       
 6 Sardinian Revolt         4 Austria        -8                              
 7 Sardinian Revolt         4 Sardinia       Carbonari                       
 8 Greek Independence       5 Ottoman Empire Greeks                          
 9 Greek Independence       5 -8             United Kingdom                  
10 Greek Independence       5 -8             France                          
# … with 432 more rows

An untidy dataframe

PEW Research Religious Landscape Study1

# A tibble: 18 × 6
   religion                `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
   <chr>                     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Agnostic                     27        34        60        81        76
 2 Atheist                      12        27        37        52        35
 3 Buddhist                     27        21        30        34        33
 4 Catholic                    418       617       732       670       638
 5 Don’t know/refused           15        14        15        11        10
 6 Evangelical Prot            575       869      1064       982       881
 7 Hindu                         1         9         7         9        11
 8 Historically Black Prot     228       244       236       238       197
 9 Jehovah's Witness            20        27        24        24        21
10 Jewish                       19        19        25        25        30
11 Mainline Prot               289       495       619       655       651
12 Mormon                       29        40        48        51        56
13 Muslim                        6         7         9        10         9
14 Orthodox                     13        17        23        32        32
15 Other Christian               9         7        11        13        13
16 Other Faiths                 20        33        40        46        49
17 Other World Religions         5         2         3         4         2
18 Unaffiliated                217       299       374       365       341

Pivoting data

We change the rows and columns of the dataframe keeping the same information.

  • Pivot longer
pivot_longer(df, cols, names_to, names_to)
  • Pivot wider
pivot_wider(df, names_from, values_from)
  • Separate
separate(df, col, into, sep)
  • Unite
unite(df, col, ..., sep)

Join data

Join data

Classifications

Missing data

Why they are missing?

NA (Not Available)

Related with the state capacity to collect them (Stone2008?)

Check the World Bank

Small or underdeveloped states

is.na()

Codebooks

Bibliography

Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.